Skip to content

feat: add AMD GPU (amdflang/OpenMP offload) container#1422

Merged
sbryngelson merged 7 commits into
MFlowCode:masterfrom
sbryngelson:feat/amd-container
May 11, 2026
Merged

feat: add AMD GPU (amdflang/OpenMP offload) container#1422
sbryngelson merged 7 commits into
MFlowCode:masterfrom
sbryngelson:feat/amd-container

Conversation

@sbryngelson
Copy link
Copy Markdown
Member

@sbryngelson sbryngelson commented May 11, 2026

Summary

  • Dockerfile: adds TARGET=amd branch — downloads AFAR drop (rocm-afar-8873-drop-22.2.0) from repo.radeon.com, installs cmake 3.28 (Ubuntu 22.04 ships 3.22 which doesn't recognise LLVMFlang), builds MPICH 3.4.3 with amdflang as the Fortran compiler so mpi.mod is compiler-compatible, and includes libnuma1/libdrm2/libdrm-amdgpu1 so only --rocm is needed at Apptainer runtime
  • docker.yml: adds amd matrix entry with full build/push/manifest steps; $TAG-amd manifest always, latest-amd on release only
  • CMakeLists.txt: makes Cray-specific MPI/hipfft paths conditional on CRAY_MPICH_INC/CRAY_HIPFORT_LIB being set; falls back to standard find_package(MPI) and find_library(hipfft/amdhip64) with $OLCF_AFAR_ROOT hints so the self-contained container works without any OLCF env vars loaded
  • toolchain: adds amd90a cluster profile (HPCFund gfx90a / MI250); fixes module variable export loop so vars that reference previously exported vars expand in the right order

Validation

  • Built mfc-amd-final.sif (Apptainer) on top of Ubuntu 22.04 + AFAR + cmake 3.28 + MPICH 3.4.3
  • All 32 dry-run tests passed
  • 1D Sod shock tube ran 1001 time steps on MI250X (gfx90a) GPU via apptainer exec --writable-tmpfs --rocm

Test plan

  • Docker CI build passes for TARGET=amd (compile + dry-run, no GPU runner needed in CI)
  • Existing cpu and gpu builds unaffected
  • latest-amd manifest pushed only on release trigger

@github-actions
Copy link
Copy Markdown

Claude Code Review

Head SHA: 1557fc4

Files changed:

  • 5
  • .github/Dockerfile
  • .github/workflows/docker.yml
  • CMakeLists.txt
  • toolchain/bootstrap/modules.sh
  • toolchain/modules

Findings:

1. OLCF_AFAR_ROOT is set to /opt/ in cpu and gpu Docker images

.github/Dockerfile — the unconditional ENV line after ARG AFAR_VERSION:

ENV OLCF_AFAR_ROOT=/opt/${AFAR_VERSION}

When the AFAR_VERSION build-arg is not supplied (the cpu and gpu matrix entries provide no AFAR_VERSION), Docker expands the ARG as an empty string, producing OLCF_AFAR_ROOT=/opt/. This real system directory is baked into the published cpu and gpu images. CMakeLists.txt uses HINTS "$ENV{OLCF_AFAR_ROOT}/lib" for find_library calls inside the LLVMFlang GPU path; on those images that resolves to /opt/lib, which could yield false positives if a matching library happens to be present there. The fix is to guard the ENV instruction under an ARG-conditional build stage, or give ARG AFAR_VERSION a sentinel default that is guaranteed absent from /opt/.

2. flang_rt.hostdevice removed from Cray CCE GPU (OpenMP) link path

CMakeLists.txt, the changed hunk around line 703–710:

-                find_package(hipfort COMPONENTS hip CONFIG REQUIRED)
-                target_link_libraries(${a_target} PRIVATE hipfort::hip hipfort::hipfort-amdgcn flang_rt.hostdevice)

The post-change Cray block (context lines 704–705) now links only hipfort::hip hipfort::hipfort-amdgcn; flang_rt.hostdevice was moved exclusively to the new elseif(CMAKE_Fortran_COMPILER_ID STREQUAL "LLVMFlang") block. Frontier builds using PrgEnv-cray (compiler ID "Cray") with OpenMP GPU offload previously linked flang_rt.hostdevice. If Cray CCE requires that library for device-code linking on AMD GPUs, this removal is a regression. The change should be validated against a live Frontier build before merging.

@sbryngelson sbryngelson force-pushed the feat/amd-container branch 3 times, most recently from f398baa to 0c98537 Compare May 11, 2026 18:37
- Dockerfile: add TARGET=amd branch — downloads AFAR drop from repo.radeon.com,
  installs cmake 3.28 (3.22 doesn't recognise LLVMFlang), builds MPICH 3.4.3
  with amdflang so mpi.mod is compiler-compatible; runtime libs libnuma1/libdrm2
  added so only --rocm is needed at apptainer runtime
- docker.yml: add amd matrix entry + build/push/manifest steps; fix cpu to run
  natively on amd64/arm64 instead of QEMU cross-build; add weekly nightly cron
- CMakeLists.txt: make Cray-specific MPI/hipfft paths conditional on
  CRAY_MPICH_INC/CRAY_HIPFORT_LIB being set; fall back to standard
  find_package(MPI) and find_library(hipfft/amdhip64) so the self-contained
  container image works without any OLCF env vars loaded
- toolchain: add amd90a cluster profile (HPCFund gfx90a / MI250); fix module
  variable export loop so vars that reference previously exported vars expand correctly
ENV OLCF_AFAR_ROOT=/opt/${AFAR_VERSION} expanded to /opt/ in cpu/gpu
images because those builds supply no AFAR_VERSION. Introduce a
dedicated OLCF_AFAR_ROOT build-arg (default "") so cpu/gpu images get
an empty var and only the AMD build passes the real path.
@sbryngelson sbryngelson force-pushed the feat/amd-container branch from 0c98537 to 9c34310 Compare May 11, 2026 18:54
@sbryngelson sbryngelson marked this pull request as ready for review May 11, 2026 18:56
@qodo-code-review
Copy link
Copy Markdown
Contributor

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

Hard-cut from old names: -gpu → -gpu-nvidia, -amd → -gpu-amd.
Also fixes AMD build step to use lowercase GH_REGISTRY and corrects
the DockerHub environment URL to mflowcode/mfc.
@sbryngelson sbryngelson merged commit f1fd862 into MFlowCode:master May 11, 2026
85 checks passed
@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.95%. Comparing base (50a0807) to head (074c996).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1422   +/-   ##
=======================================
  Coverage   64.95%   64.95%           
=======================================
  Files          72       72           
  Lines       18880    18880           
  Branches     1573     1573           
=======================================
  Hits        12263    12263           
  Misses       5641     5641           
  Partials      976      976           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant